The present invention relates to a system, a method and a program for acquiring a character string and the like that should be newly recognized as a word. Particularly, the present invention relates to a system, a method and a program for acquiring, for speech processing, a set of a character string and a pronunciation that should be recognized as a word.
In a large vocabulary continuous speech recognition (LVCSR) system, highly accurate speech recognition requires a word dictionary in which words and phrases included in the speech are recorded, and a language model by which an appearance frequency and the like of each word or phrase can be derived. In addition, in order to improve accuracy of processing of the speech recognition, it is desirable that these word dictionary and language model cyclopedically contain the words included in the speech that should be recognized. On the other hand, because there are limitations on both the capacity of a storage device for memorizing a dictionary and the like, and the performance of a CPU which calculates frequency values, it is desirable that these word dictionary and language model be minimal to the extent that they do not contain unnecessary words.
However, enormous time, effort and expense are required for manual construction of the dictionary containing even only a minimum of words and phrases. More specifically, when a dictionary is constructed from Japanese texts, for example, it is necessary to analyze segmentations of words, firstly, and then to assign a correct pronunciation to each of the segmented words. Since a pronunciation is information on a reading way expressed with phonetic symbols and the like, expert linguistic knowledge is necessary in order to assign such information of a pronunciation in some cases. Such work and expense can be a problem particularly when speech recognition is attempted in a specific field of expertise. This is because information such as a general dictionary that have been accumulated cannot be very useful, and also because sufficient time, effort and expense cannot be spent due to a low demand.